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ABSTRACT 

This  paper  presents  a  software  system  for  online  image- 
based  terrain  classification  that  mimics  a  human 
supervisor’s  segmentation  and  classification  of  training 
images  into  “Go”  and  “NoGo”  regions.  The  system 
identifies  a  set  of  image  chips  (or  exemplars)  in  the 
training  images  that  span  the  range  of  terrain 
appearance.  It  then  uses  the  exemplars  to  segment 
novel  images  and  assign  a  Go/NoGo  classification. 
System  performance  is  compared  to  that  obtained  via 
offline  fuzzy  c-means  clustering. 

Key  words:  terrain  classification,  computer  vision, 
machine  learning,  exemplar  memory 

I.  Introduction 

Monocular  and  stereo  video  cameras  continue  to  be  the 
most  practical  vision  sensor  for  small  inexpensive 
robots.  Unfortunately,  unstructured  vision-based 
navigation  remains  an  especially  difficult  problem.  In  this 
paper,  we  present  an  approach  to  automated  image 
segmentation  and  terrain  classification  using  exemplars, 
or  small  image  samples,  to  represent  the  variety  of 
terrain  appearance. 

Exemplars  are  used  as  cluster  seeds  to  segment  the 
terrain.  Local  pieces  of  terrain  are  assigned  to  the 
exemplar  to  which  they  are  most  similar  in  appearance 
and  inherit  the  terrain  class  membership  of  the 
exemplar.  Exemplar  models  assume  that  intact  stimuli 
are  stored  in  memory,  and  that  classification  or 
recognition  is  determined  by  the  degree  of  similarity 
between  a  stimulus  and  the  stored  exemplars.  Simple 
generalization  effects  explain  correct  classification  of 
novel  (previously  unseen)  instances  of  categories.  Only 
the  item  information  is  used  for  classification  decisions. 
Categorization  relies  on  the  comparison  of  a  new 
stimulus  with  known  exemplars  of  the  category. 


Exemplar  models  are  the  most  parsimonious  models  of 
categorization  in  terms  of  the  underlying  associative 
mechanism  [1],  Exemplar  based  learning  was  originally 
proposed  as  a  model  of  human  learning  in  Ref.  [2],  and 
has  since  been  shown  to  explain  both  human  and  animal 
visual  classification  performance  significantly  better  than 
alternative  hypotheses  of  feature-based  and  prototype- 
based  processing  [3,4]. 

Various  researchers  have  begun  to  develop  methods  to 
forecast  traversability  using  estimates  of  geometrical 
properties  inferred  from  non-contract  sensors. 
References  [5]  and  [6]  developed  a  fuzzy-rule-based 
system  to  mimic  human  “high/medium/low”  trafficability 
assessment  based  on  measures  of  roughness,  slope 
and  distance  between  obstacles  computed  from  stereo 
imagery.  The  system  was  targeted  for  planetary  rover 
environments.  Reference  [7]  used  a  stereo  color  vision 
system  together  with  a  single  axis  LADAR  to  classify 
terrestrial  terrain  cover  and  detect  obstacles.  They  noted 
that  the  color-based  classification  system  could  be  made 
more  robust  by  considering  texture  of  regions  and  shape 
features  of  objects.  Reference  [8]  defined  a  trafficability 
index  equal  to  the  weighted  sum  of  the  slope  and 
roughness  estimated  from  line-scanning  laser 
rangefinder  data.  Reference  [9]  classified  terrain  as 
impassible  (NoGo)  if  any  of  several  properties  were 
above  a  threshold:  height  variation,  the  surface  normal 
orientation,  and  the  presence  of  an  elevation 
discontinuity  (all  estimated  from  LADAR  imagery). 
Reference  [10]  developed  a  rule-based  system  for 
terrain  classification  from  LADAR  and  color  camera 
imagery. 

Appearance  based  approaches  do  not  attempt  to  directly 
estimate  geometrical  properties  and  then  infer 
traversability.  Instead,  they  associate  the  operator’s 
assessment  of  trafficability  directly  from  the  terrain 
appearance.  The  operator’s  trafficability  assessment  is 
not  restricted  to  geometrical  properties,  but  can  also 
reflect  surface  properties  (e.g.,  friction,  resistance, 
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sinkage)  and  factors  that  do  not  affect  traversability  but 
which  nonetheless  exclude  certain  terrain  (e.g.,  the  risk 
of  being  run  over  by  a  car  or  the  need  to  avoid  detection 
by  staying  in  shaded  areas). 

Various  applications  could  benefit  from  automatic 
methods  to  segment  and  classify  terrain  from  images, 
such  as  virtual  reality  simulated  terrain,  mobile  robot 
navigation,  combat  engineering  planning,  and  land  cover 
analysis  for  ecological  studies.  These  applications 
address  different  scales,  terrain  features  and  classes  of 
interest.  It  is  unlikely  that  any  specific  segmentation  and 
classification  criteria  would  be  suitable  for  all  of  these 
applications.  Nonetheless,  the  applications  have 
important  similarities.  In  all  cases,  we  implicitly  assume 
that  local  areas  with  similar  appearance  should  be 
grouped  together  in  any  segmentation,  and  that  they  are 
likely  to  be  representatives  of  the  same  terrain  class.  We 
also  implicitly  assume  that  we  know  in  advance  what 
terrain  classes  we  are  interested  in  and  what  they 
commonly  look  like.  For  the  purposes  of  this  research, 
we  assume  that  the  segmented  terrain  regions  or 
regions  of  the  same  terrain  class  do  not  have  any  a  priori 
constraints  on  their  geometric  shape  or  global 
organization.  We  also  assume  that  there  are  no  a  priori 
constraints  regarding  which  terrain  classes  can  be 
adjacent  to  each  other. 


Fig.  1 :  Input  training  image  and  classification. 


The  approach  is  currently  implemented  as  a  software 
system  designed  to  provide  considerable  flexibility  in  the 
choices  of  perspective  transformation,  resolution,  scale, 
sampling  and  difference  metric.  In  general,  different 
choices  will  be  appropriate  for  different  applications.  The 
software  automatically  builds  a  characteristic  “basis  set” 
of  exemplars  from  training  images.  It  provides  an  option 
for  building  a  set  of  exemplars  for  each  terrain  class, 
with  the  union  over  the  terrain  classes  being  the  basis 
set  exemplars  for  an  application.  A  second  option  is  to 
build  a  set  of  terrain  segmentation  exemplars 
independent  of  the  terrain  classes,  and  then  associate 
the  exemplars  with  terrain  classes.  In  its  present  form, 
the  software  does  not  attempt  to  resolve  ambiguities 
when  an  area  does  not  resemble  any  of  the  a  priori 
terrain  classes,  or  areas  that  have  partial  membership  in 
two  or  more  terrain  classes.  Instead,  it  produces  a  fuzzy 
classification,  i.e.,  a  segment  of  terrain  can  have  partial 
membership  in  different  terrain  classes,  and  may  be 
partially  unclassified. 


II.  Technical  Approach 

The  algorithm  is  organized  into  two  routines:  one  for 
training  and  one  to  apply  segmentation  and 
classification.  At  the  end  of  training,  the  exemplar  bank 
and  associated  data  are  stored  in  a  file  to  be  loaded 
before  applying  the  segmentation  and  classification. 

A.  Training  [mages  and.  Overlays 

The  user  must  provide  a  set  of  representative  training 
images.  Ideally,  the  training  images  would  be  drawn 
from  the  same  distribution  as  the  downstream 
application  images.  In  practice,  it  may  not  be  possible  to 
ensure  this.  The  effect  on  segmentation  and 
classification  performance  of  different  terrain,  foliage, 
season,  lighting,  and  weather  between  the  training 
image  set  and  test/application  image  set  is  a  question  for 
empirical  investigation.  In  principle,  the  images  can  be 
multi-spectral  with  an  arbitrary  number  of  planes. 
Currently,  the  software  assumes  that  the  images  are 
RGB  or  monochrome  images  stored  in  a  standard  image 
format. 

For  each  training  image,  a  corresponding  terrain 
classification  overlay  is  required  that  denotes  which 
locations  correspond  to  which  terrain  class.  One 
approach  is  to  use  an  N  plane  image,  where  N  is  the 
number  of  terrain  classes  and  each  plane  is  a  binary 
image.  An  alternative  approach  is  to  use  a  single  plane 
with  integer  values  from  1  to  N  (for  the  N  terrain 
classes),  and  zero  for  unclassified  locations.  This 
representation  is  more  appropriate  when  there  are  a 
large  number  of  terrain  classes,  or  when  the  terrain 
classes  constitute  an  ordered  set,  e.g.,  ordered  by 
traverse  ability  cost  or  by  speed-made-good.  For 
purposes  of  demonstration,  we  use  two  terrain  classes 
(e.g.,  “Go”  and  “NoGo”  regions)  and  the  overlays  are 
stored  as  three-plane  RGB  images  (the  third  plane  is  not 
used).  The  terrain  classification  is  displayed  as  an  RGB 
image  in  which  one  terrain  class  is  coded  red  and  the 
other  is  coded  green,  with  blue  used  to  code  unclassified 
regions.  An  example  of  this  is  shown  in  Fig.  1,  where 
the  gravel  driveway  is  designated  as  a  “Go”  region  and 
everything  else  is  designated  a  “NoGo”  region. 

B.  Perspective  TransformatiQlL  Resolution.  Scale  and 

Sampling 

In  some  cases,  a  transformation  from  original  camera 
perspective  may  be  appropriate.  In  the  camera  image 
view,  pixels  represent  the  same  angle  (assuming  lens 
distortion  effects  are  minimal),  but  do  not  project  onto 
equal  areas  of  ground.  This  is  problematic  since  terrain 
appearance  changes  with  range  and  thus,  would  require 
multiple  instances  of  the  same  terrain  for  training  (at 
different  ranges). 


Assuming  the  elevation  of  the  camera  is  large  relative  to 
the  variation  in  ground  elevation  in  the  scene,  the 
pseudo  plan  view  projection  can  be  used  to  create  a  new 
image  in  which  each  pixel  corresponds  to  the  same 
ground  area  (see  Fig.  2).  The  pseudo  plan  view 
projection  is  good  for  areas  where  the  variation  in 
elevation  is  small  relative  to  the  elevation  of  the  camera, 
but  produces  distortion  when  this  is  not  the  case.  An 
alternative  projection  is  to  restrict  analysis  to  horizontal 
sub-bands  within  the  image.  The  band  view  does  not 
distort  vertical  objects,  but  retains  the  perspective 
distortion  of  the  original  camera  image  for  flat  earth 
regions.  A  third  alternative  is  to  use  a  stereovision 
camera  to  measure  range  and  warp  the  image 
accordingly,  such  that  each  image  chip  roughly 
corresponds  to  equal  areas  of  ground. 

Both  the  pseudo  plan  view  and  camera  view  options  are 
supported  in  the  current  software.  Both  transformations 
require  the  size  of  the  camera  image,  and  the  angle 
subtended  by  an  individual  pixel  (we  assume  square 
pixels).  The  pseudo  plan  view  projection  requires  three 
additional  inputs:  (1)  the  height  of  the  camera  above 
ground  plane,  (2)  the  distance  on  the  ground  from  the 
spot  below  the  camera  to  the  ground  projection  of  the 
bottom  row  of  the  image,  and  (3)  the  desired  resolution 
of  the  projected  image,  i.e.  the  pixel  width  of  the  output 
projection  in  centimeters. 


Fig  2:  Camera  image  view  and  pseudo  plan  view. 

The  camera  band  view  also  requires  three  additional 
inputs:  (1)  the  image  row  number  of  the  top  row  of  the 
band,  (2)  the  image  row  number  of  the  bottom  row  of  the 
band,  and  (3)  the  resolution  for  the  band-view  image  (the 
angle  of  pixels  in  the  band  view  image  must  be  less  than 
or  equal  to  the  pixel  angle  of  the  original  camera  image). 

The  user  must  also  specify  the  analysis  scale  for  terrain 
segmentation  and  classification.  The  segmentation  and 
classification  is  based  on  exemplar  image  chips  (square 
chips  in  the  current  software).  The  scale  is  the  width  of 
the  exemplar  chips.  Membership  in  a  terrain  class  is 
considered  to  be  a  bulk  property  of  a  local  region,  not  a 
point-location  property.  The  user  must  also  specify  the 


center-to-center  spacing,  or  sampling  distance,  for  the 
output  segmentation  and  classification  images. 

C.  Image  Space  Transformation 

The  purpose  of  the  image  space  transformation  is  to 
amplify  the  importance  of  selected  image  properties.  For 
example,  the  imagery  can  be  transformed  into  a  variety 
of  color  spaces.  The  importance  of  color  could  be 
strengthened  or  weakened  by  weighting  different  image 
planes.  In  addition  to  the  RGB  color  coordinate  system, 
we  have  experimented  with  the  HSV  (hue,  saturation, 
value)  and  L*a*b*  (luminance,  red/green,  yellow/blue) 
systems. 

Another  transformation  option  is  to  adjust  the  high 
spatial  frequency  content  relative  to  low  spatial 
frequency  content  by  constructing  a  multi-resolution 
pyramid  representation  and  then  applying  weights  to  the 
image  planes.  A  common  example  is  the  laplacian-of- 
gaussian  spatial  bandpass  pre-filtering  often  used  in 
stereo-vision  processing. 

The  space  transformation  could  increase  the 
dimensionality  of  the  image  space.  Consider  a 
monocular  image  input.  The  image  could  be  processed 
through  a  bank  of  N  spatial  filters,  such  as  edge  and 
corner  filters  at  different  spatial  scales  and  orientations. 
Each  filter  produces  a  single-plane  output  image. 

D.  The  Exemplar  Basis  Set 

The  current  software  processes  the  training  images  one 
at  a  time.  There  is  an  option  to  find  exemplars  of  each 
image  independent  of  exemplars  from  other  images,  or 
to  find  only  new  exemplars  sufficiently  different  from 
exemplars  built  from  preceding  images.  The  current 
image  is  chopped  into  chips  at  the  specified  scale  and 
sampling  distance.  If  the  option  was  selected  to  process 
the  image  independently  from  previous  images,  all  chips 
are  nominated  as  potential  exemplars.  If  the  exemplar 
processing  is  in  the  context  of  previous  exemplars,  only 
chips  whose  minimum  distance  (in  terms  of  the  image 
metric)  to  existing  exemplars  is  greater  than  the  current 
clustering  threshold  are  nominated  as  potential 
exemplars:  chips  that  resemble  current  exemplars  are 
not  considered  as  possible  new  exemplars. 

Each  chip  is  compared  to  its  neighbors  within  a  specified 
radius  to  calculate  the  difference  metric  between  it  and 
each  of  its  neighbors  (the  radius  is  a  user  input).  The 
aggregate  local  difference  between  the  chip  and  its 
neighbors  is  calculated  as  the  weighted  average  of  the 
mean  and  minimum  differences  (The  weight  is  a  user 
input.  Weighting  towards  the  minimum  leads  to  a  larger 
pool  of  exemplars,  and  weighting  towards  the  mean 
leads  to  a  smaller  pool  of  exemplars).  Chips  similar  to 


their  neighbors  are  preferred  over  those  that  are 
different. 

The  algorithm  calculates  a  clustering  threshold  equal  to 
the  weighted  sum  of  the  minimum  and  maximum  local 
differences  over  all  chips  (The  weight  is  a  user  input. 
Weighting  towards  the  minimum  leads  to  a  larger  pool  of 
exemplars  and  tighter  clusters.  Weighting  towards  the 
maximum  leads  to  a  smaller  pool  of  exemplars  and 
broader  clusters).  This  threshold  provides  the  system’s 
adaptation  ability.  Training  images  with  significant 
variability  provide  coarser  segmentation  over  training 
images  with  lower  variability,  for  the  same  size  of 
exemplar  bank. 

Exemplars  for  the  current  image  are  selected  iteratively. 
Initially,  no  chips  are  rejected.  Of  the  non-rejected  chips, 
the  one  with  the  minimum  local  difference  is  added  to  the 
bank  of  exemplars.  All  chips  with  difference  less  than 
the  clustering  threshold  from  the  exemplar  are  rejected. 
This  process  is  iterated  until  all  chips  have  either  been 
added  to  the  exemplar  bank  or  rejected.  The  exemplars 
for  the  current  image  are  then  merged  with  the  bank  of 
exemplars  from  the  previous  images. 

E.  Image  Chip  Difference  Metric 

Image  difference  metrics  remain  an  open  issue  in  the 
evaluation  of  image  compression  schemes.  While  it  is 
easy  to  measure  the  amount  of  compression  and  the 
encoding/decoding  time,  it  is  not  clear  how  to  measure 
the  quality  of  the  reconstructed  image,  i.e.,  its  difference 
in  appearance  from  the  original.  Different  image 
characteristics  are  important  depending  on  the  image 
content,  the  questions  at  hand,  and  who  is  looking  at  the 
image. 


Fig  3:  Reconstruction  of  training  image  from 
exemplars  and  resulting  classification. 

Similarly,  there  is  no  obviously  correct  metric  for 
measuring  the  difference  between  two  images.  Before 
the  images  are  chopped  into  chips,  they  can  be 
processed  to  balance  the  relevant  image  characteristics 
(see  II. C  Image  Space  Transformation).  In  principle, 
therefore,  simple  measures  of  the  aggregate  difference 
are  all  that  are  needed.  Even  so,  there  are  many 
different  ways  to  calculate  the  difference  between  two 
image  chips.  Some  metrics  are  computed  from  the 
pixel-by-pixel  difference  between  two  chips,  others  are 
calculated  from  the  difference  in  statistics  computed 
from  the  individual  chips,  e.g., 


•  the  sum  over  all  pixel  locations  and  all  image  planes 
of  the  absolute  value  of  the  difference  between  the 
two  images; 

•  the  root  sum  square  over  all  pixel  locations  and  all 
image  planes  of  the  difference  between  the  two 
images; 

•  the  maximum  over  all  image  planes  of  the  sum  over 
all  pixel  locations  of  the  absolute  value  of  the 
difference  between  the  two  images; 

•  the  sum  over  all  pixel  locations  of  the  maximum  over 
all  image  planes  of  the  absolute  value  of  the 
difference  between  the  two  images; 

•  the  root  sum  square  over  all  image  planes  of  the 
difference  in  the  mean  values  and  difference  in 
standard  deviations  (over  pixel  locations)  of  the  two 
images;  and 

•  the  sum  over  all  image  planes  of  the  absolute 
difference  in  the  mean  values  and  difference  in 
standard  deviations  (over  pixel  locations)  of  the  two 
images. 

The  first  four  metrics  are  computed  from  pixel-by-pixel 
differences  of  the  image  chips,  while  the  last  two  metrics 
are  computed  from  statistics  of  the  image  chips. 
Although,  the  software  is  set  up  to  incorporate  different 
metrics,  the  results  in  this  paper  are  based  on  the  first 
and  last  metrics. 

F.  Exemplar,  Membership  in  Terrain  Classes 

Each  image  chip  maps  to  a  region  in  the  terrain 
classification  overlay.  The  terrain  classification  of  the 
image  chip  is  simply  the  expected  membership  in  each 
of  the  terrain  classes.  It  is  possible  that  a  chip  could 
straddle  more  than  one  terrain  class,  or  could  straddle 
an  unclassified  portion  of  the  overlay.  After  the  new 
exemplars  are  added  to  the  exemplar  bank,  the  current 
image  is  segmented  using  all  of  the  exemplars  in  the 
bank.  Each  chip  location  in  the  image  is  assigned  to  the 
exemplar  to  which  it  is  closest,  provided  the  distance  is 
less  than  the  current  clustering  threshold.  In  some 
cases,  some  image  chips  may  not  be  associated  with 
any  exemplar.  For  each  exemplar  in  the  bank,  we 
accumulate  the  number  of  times  the  exemplar  is  “hit”  by 
an  image.  The  terrain  class  membership  of  the  exemplar 
is  the  mean  over  all  chips  associated  with  the  exemplar, 
of  terrain  class  memberships  of  the  chips.  The  terrain 
segmentation  is  converted  to  terrain  classification  by 
assigning  each  location  the  terrain  class  membership 
values  of  the  exemplar  associated  with  that  image 
location. 

G.  Output  Illustration  Controls 

The  algorithm  contains  options  to  output  different  images 
to  illustrate  and  provide  insight  into  the  processing: 


•  the  pseudo  plan  view  or  camera  band  view 

perspective  transformation  of  the  image; 

•  the  pseudo  plan  view  or  camera  band  view 

perspective  transformation  of  the  terrain  class 
overlay; 

•  the  exemplar  chips  (at  their  location  in  the  image) 
selected  from  the  current  image; 

•  the  segmentation  of  the  current  image  based  on  the 
current  bank  of  exemplars;  and 

•  the  classification  of  the  image  based  on  the  current 
bank  of  exemplars. 

There  is  no  obvious  and  correct  way  to  represent  the 
different  segments  for  purposes  of  visualization.  Color¬ 
coding  shows  the  different  segments,  but  does  not  give 
much  insight  into  the  basis  for  the  segmentation.  The 
software  illustrates  the  segmentation  in  a  way  that 
provides  direct  visual  insight  into  the  basis  for  the 
segmentation.  To  visualize  the  segmentation,  the 
software  replaces  each  image  chip  with  the  exemplar 
chip  that  it  is  associated  with  (image  chips  not 
associated  with  any  exemplar  appear  black)  (See  Fig. 
3).  When  the  sampling  distance  is  less  than  the 
exemplar  scale,  the  exemplars  are  blended  in  the 
reconstruction.  The  visualization  image  is  the  same  size 
as  the  pseudo  plan  view  or  camera  band  view 
perspective  image,  so  it  is  easy  to  directly  compare  the 
two.  By  using  the  exemplar  chips  themselves,  the 
visualization  image  shows  what  the  exemplars  look  like, 
and  which  image  chips  they  are  associated  with.  Finally, 
comparing  the  visualization  to  the  perspective  image 
gives  prima  fascia  evidence  of  the  credibility  of  the 
segmentation. 


Fig.  4:  Test  images,  reconstruction  from 
exemplars,  and  resulting  classification.  (One  RGB 
training  image) 


H.  Application  for  Segmentation  and  Classification 


The  application  routine  reads  in  the  filter  bank  and 
associated  data  produced  by  the  training  routine.  It 
segments  and  classifies  the  test  images  one  at  a  time. 
No  changes  are  made  to  the  exemplar  bank  or 
associated  data.  After  pseudo  plan  view  or  camera  band 
view  perspective  processing,  the  test  image  is  chopped 


into  chips  at  the  specified  scale  and  sampling  distance. 
Each  image  chip  is  assigned  to  the  closest  matching 
exemplar,  providing  the  match  is  within  the  current 
clustering  threshold,  otherwise  the  chip  is  unassigned. 
This  produces  the  segmentation  by  exemplars.  After  the 
segmentation,  each  location  is  assigned  the  terrain  class 
fuzzy  membership  of  the  segmenting  exemplar.  The 
classification  image  is  at  the  resolution  of  the  center-to- 
center  sampling  distance. 


Fig.  5:  Test  images,  reconstruction  from 
exemplars,  and  resulting  classification.  (Two  RGB 
training  images) 


III.  Demonstration  Results 


This  section  illustrates  the  segmentation  and 
classification  system.  The  demonstration  uses  color¬ 
coding  to  show  the  terrain  classification  into  Go  (green), 
NoGo  (red),  and  Unclassified  (blue)  regions.  Fig.  4 
shows  classification  results  derived  from  the  single 
training  image  in  Fig.  1,  where  gravel  is  designated  “Go” 
and  everything  else  is  “NoGo.”  This  training  resulted  in 
25  exemplars.  Note  the  errors  due  to  the  building  in  the 
upper  image  and  in  the  lower  image  due  to  the 
shadowed  gravel.  Adding  a  second  training  image 
similar  to  the  lower  image  in  Fig.  4,  results  in  the 
classification  results  of  Fig.  5,  with  78  exemplars.  Note 
the  overall  improvement  in  the  shadowed  region  and  in 
the  grassy  areas.  However,  the  upper  image 
classification  has  become  noisier. 

To  compensate  for  different  lighting  conditions,  we 
turned  to  the  HSV  (hue,  saturation,  value)  color  space. 
Although  this  resulted  in  some  improvements,  at  the 
expense  of  more  exemplars,  the  HSV  system  is 
unsatisfactory  due  to  the  cyclical  nature  of  hue  and  the 
fact  that  HSV  is  far  from  perceptually  uniform.  This  led  to 
the  implementation  of  an  L*a*b*  color  space  transform, 
where  L*  refers  to  luminance  and  the  a*  and  b* 
components  encode  the  color  information.  The 
transformation  to  L*a*b*  is  nonlinear,  resulting  in 
components  that  are  nearer  to  perceptually  uniform. 

Figure  6  shows  the  results  of  training  the  algorithm  with 
images  transformed  to  the  L*a*b*  color  space.  The 


upper  image  is  similar  to  the  RGB  classification,  while 
the  lower  image  is  much  improved.  However,  the 
number  of  exemplars  has  increased  by  a  factor  of  two  to 
172.  Note  that  the  images  in  Fig.  6  are  from  the  original 
RGB  color  space,  not  the  L*a*b*  color  space,  as  the 
latter  is  more  difficult  to  interpret. 


Fig.  6:  Test  images,  reconstruction  from 
exemplars,  and  resulting  classification.  (Two 
L*a*b*  training  images) 


Color  alone  is  not  always  a  good  indication  of  image 
matching,  and  therefore  we  have  also  included  texture 
as  an  additional  dimension  on  which  to  differentiate  and 
compare  image  exemplars.  Figure  7  shows  the  results  of 
adding  a  texture  plane,  computed  by  calculating  the 
standard  deviation  over  a  sliding  window  throughout  the 
image.  The  classification  is  smoother,  but  not 
significantly  better  than  without  texture  on  these  two 
images  and  the  number  of  exemplars  has  increased  to 
278. 


Fig.  7:  Test  images,  reconstruction  from 
exemplars,  and  resulting  classification.  (Two 
L*a*b*  training  images  with  texture) 


All  the  preceding  analysis  was  performed  using  a 
difference  metric  based  on  computing  the  pixel-by-pixel 
difference  between  the  image  chips.  There  is  also  the 
option  of  computing  statistics  on  each  image  chip  and 
then  computing  the  difference  between  the  statistics. 
Figure  8  shows  the  results  of  that  analysis,  which 
required  only  75  exemplars,  similar  to  the  previous  RGB 
classification  with  no  texture,  but  with  much  better 
classification  accuracy.  In  this  example,  the  computed 
distance  metric  was  the  sum  of  the  absolute  differences 


of  the  mean  and  standard  deviation  over  each  image 
plane. 


Fig.  8:  Test  images,  reconstruction  from 
exemplars,  and  resulting  classification.  (Two 
L*a*b*  training  images  with  texture  and  using 
statistical  differences) 

The  low  number  of  features  allows  the  possibility  of 
using  other  learning  algorithms  such  as  neural  networks, 
support  vector  machines,  or  the  various  clustering 
methods.  Memory  requirements  are  also  reduced  since 
only  the  statistics  of  the  exemplar  are  stored,  not  the 
entire  chip. 

IV.  Comparison  to  Other  Techniques 

A  Fuzzy_  Clustering 

To  compare  our  online  classification  methodology  to 
other  techniques,  we  turned  to  a  more  realistic  and 
difficult  problem  using  the  same  set  of  images.  Instead 
of  segmenting  out  gravel  versus  everything  else,  we 
segmented  out  gravel  and  grass  versus  everything  else. 
The  data  consisted  of  two  image  sequences.  Figure  9 
shows  the  two  training  images,  which  are  the  same  as 
for  the  preceding  analysis,  and  their  associated 
segmentation  masks.  We  chose  23  other  images  from 
the  two  image  sequences  to  test  the  algorithms,  which 
required  hand  drawing  classification  maps  for  each  of 
the  test  images. 


Fig.  9:  Input  training  images  and  classification  for 
extended  comparison. 


Since  the  online  algorithm  produces  exemplars  that  are 
essentially  cluster  centers,  it  is  natural  to  compare  the 
performance  against  a  standard  clustering  algorithm, 
such  as  fuzzy  c-means  clustering  (FCM)  [11].  Since  the 
difference  metric  uses  the  absolute  difference  between 
feature  vectors,  while  the  FCM  algorithm  computes  a 
root-mean-square  difference,  we  took  the  square  root  of 
the  feature  vectors  before  passing  them  to  the  FCM 
algorithm.  We  also  replaced  the  cluster  centers, 
computed  by  the  FCM  algorithm,  with  the  closest  feature 
vector  in  order  to  replicate  the  use  of  exemplars  and  to 
compute  the  reconstruction  images.  The  online 
algorithm  analyzes  each  class  separately,  and  then  tests 
each  test  feature  vector  against  exemplars  from  each 
class.  We  replicated  this  behavior  in  the  FCM  algorithm 
by  segmenting  the  two  classes  and  computing  clusters 
for  each  separately.  One  difference  that  we  have  not 
replicated  is  that  the  online  algorithm  currently  uses  an 
unclassified  category  when  an  image  chip  is  too  far  from 
any  exemplar  based  on  an  adaptive  threshold.  This  is 
seen  in  the  blue  chips  in  the  classification  image  and  the 
corresponding  black  chips  in  the  reconstruction  images. 
The  FCM  algorithm  simply  chooses  the  closest 
exemplar. 


Fig.  10:  Examples  of  test  images  and  hand-drawn 
classification  maps. 

We  also  implemented  a  metric  to  compare  the  output 
classification  mask  to  a  user-drawn  mask.  Fully  “Go” 
regions  are  mapped  to  +1,  fully  “NoGo”  regions  are 
mapped  to  -1,  and  unknown  regions  are  mapped  to  0. 
The  chosen  difference  metric  is  the  absolute  difference 
divided  by  the  sum  of  absolute  values.  While  this  metric 
handles  the  fuzzy  classification  in  the  previous  section, 
in  order  to  compare  with  other  methods,  we  defuzzified 
the  classification  map  and  also  mapped  the  unknown 
regions  to  “NoGo,”  which  would  normally  be  done  for 
cautious  driving.  We  modified  the  code  from  Ref.  [12]  for 
our  implementation  of  the  FCM  algorithm. 

B  Validation 

One  issue  with  typical  clustering  algorithms  is  the 
requirement  to  choose  the  number  of  clusters 
beforehand.  There  are  a  number  of  published  validation 


measures  that  can  be  used  to  find  an  optimum  number. 
Two  common  measures  are  the  Partition  Coefficient, 
which  measures  the  amount  of  overlapping  between 
clusters,  and  the  Partition  Entropy,  which  measures  the 
fuzziness  of  the  partitioning  [1 1,12].  However,  these  tend 
to  scale  monotonically  with  the  number  of  clusters  and 
one  must  find  the  ‘knee  in  the  curve’  to  estimate  the 
optimal  number  of  clusters.  This  can  be  problematic 
when  the  data  is  noisy.  Two  other  measures  are  the 
Partition  Index,  which  measures  separation  and 
compactness  of  the  clusters,  and  the  Separation  Index, 
which  uses  a  minimum  distance  rather  than  an  average 
distance  [12].  While  the  latter  measures  provide  a 
clearer  optimality  point,  we  found  that  they  did  not 
always  correlate  well  with  the  measured  training  and  test 
error  in  our  data  (See  Fig.  11).  However,  they  may 
provide  more  accurate  predictions  with  a  larger  training 
set. 


Clusters 

Fig  11:  Test  error  (black),  Partition  Index  (red), 
Separation  Index  (blue),  and  Unused  Exemplars 
(green),  as  a  function  of  number  of  training 
clusters. 

The  aforementioned  measures  determine  desirable 
attributes  of  a  clustering  structure  and  measure  how  well 
the  current  clustering  scheme  adheres  to  them.  Instead, 
we  have  discovered  a  method  that  is  specific  to  the 
exemplar  replacement  version  of  FCM  clustering  that  we 
have  implemented  and  is  based  on  a  more  utilitarian 
measure  of  cluster  effectiveness.  We  noticed  that  when 
we  replaced  cluster  centers  with  exemplars,  there  was 
duplication  in  the  identification  of  the  exemplars.  This 
resulted  in  the  situation  where  we  would  specify  a  given 
number  of  clusters  for  training,  but  when  the  cluster 
centers  were  replaced  with  exemplars,  all  the  exemplars 
would  not  be  used  in  classifying  the  training  set.  That  is, 
there  were  unused  exemplars.  As  we  specified  more 
clusters,  the  number  of  unused  exemplars  would 
increase.  We  conjecture  that  it  is  at  the  point  where  one 
starts  seeing  unused  exemplars  that  we  are  near  the 
optimal  number  of  clusters.  This  is  exemplified  in  Fig.  1 1 
where  the  number  of  unused  exemplars  starts  to 
increase  when  the  total  number  of  exemplars  reaches 


about  40.  Note  that  this  is  also  near  the  region  where 
the  test  area  is  smallest.  This  measure  gives  a 
reasonably  well  defined  location  to  stop  adding  clusters 
to  the  training  and  should  allow  a  simple  method  for 
searching  for  optimal  number  of  clusters. 

C  Comparison  Besults 

For  our  comparison,  we  chose  to  set  the  number  of 
training  clusters  at  40.  Figures  12  and  13  show  the 
clustering  results  for  the  two  test  images  of  Figure  10. 
Note  that  while  the  clustering  algorithm  tends  to  produce 
more  accurate  classification  maps,  the  differences  are 
not  overly  large.  This  is  borne  out  by  the  group 
classification  accuracy  for  the  two  methods,  where  the 
combined  classification  error  over  the  23  test  images  is 
0.163  with  92  exemplars  for  the  online  method,  while  the 
FCM  algorithm  had  an  error  of  0.1 15  with  39  exemplars. 
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Fig.  12:  Test  image  reconstruction  from  exemplars 
and  resulting  classification  for  the  online  (upper) 
and  clustering  (lower)  methods.  (Two  L*a*b* 
training  images  with  texture) 


Fig.  13:  Test  image  reconstruction  from  exemplars 
and  resulting  classification  for  the  on-line  (upper) 
and  clustering  (lower)  methods.  (Two  L*a*b* 
training  images  with  texture) 

It  appears  that  the  best  choice  for  a  complete  system 
would  be  a  combination  of  the  two  algorithms  employed. 
The  clustering  algorithm  could  be  used  for  the  initial 
offline  training.  The  online  learning  algorithm  would  then 
be  employed  while  the  system  is  running. 


V.  Findings  and  Observations 

This  paper  has  demonstrated  an  approach  to  image- 
based  terrain  segmentation  and  classification  using 
exemplars.  Exemplars  provide  a  simple  way  to  represent 
the  characteristic  color/luminance  and  spatial  patterns  of 
terrain.  Since  the  exemplars  are  drawn  from  training 
images  in  such  a  way  as  to  span  the  appearance  of  the 
training  images,  they  are  well  suited  to  represent  the 
variations  of  appearance  without  an  a  priori  model  of 
terrain  appearance.  The  software  system,  as  presented, 
allows  for  considerable  flexibility  to  specify  the 
perspective  transformation,  image  space  transformation, 
scale,  resolution,  sampling  density,  and  image  difference 
metric.  Empirical  research  is  needed  to  tune  these 
options  for  specific  applications. 

Preliminary  results  indicate  the  approach  has  potential  to 
segment  terrain  in  a  manner  that  is  consistent  with 
subjective  perception.  The  segmentation  appears  to  be 
robust  over  changes  in  lighting,  specific  terrain,  and 
automatic  camera  gain  and  contrast  adjustments.  Our 
previous  results  indicated  that  analysis  in  the  camera 
band  view  was  more  useful  for  segmenting  and 
classifying  positive  obstacles  than  the  pseudo  plan  view. 
When  presented  with  novel  images,  the  camera  band 
view  was  more  likely  to  produce  mixed  Go/NoGo  terrain 
classification,  whereas  the  pseudo  plan  view  was  more 
likely  to  produce  unclassified  terrain  segments.  This  may 
be  due  to  the  fact  that  the  camera  band  view  mixes 
different  scales,  whereas  the  pseudo  plan  view 
maintains  more  consistent  scale. 

The  algorithm  performs  quite  well  on  the  simplistic 
segmentation  of  gravel  from  other  terrain.  When 
presented  with  a  combination  of  both  grass  and  gravel, 
the  system  still  performed  reasonably  well.  Nonetheless, 
the  preliminary  analysis  is  not  adequate  to  assess  the 
value  of  this  method  of  terrain  classification  for  any 
specific  application,  e.g.,  robot  navigation.  More 
extensive  testing,  with  a  structured  experimental 
objectives  and  design  are  needed  to  evaluate  the 
applicability  of  this  method  of  terrain  classification  for 
any  specific  application.  The  algorithm  is  reasonably 
fast,  with  the  largest  time  consumption  actually  being  the 
reconstruction  of  the  segmentation  images  by  inserting 
exemplars.  But  this  step  is  for  visualization  purposes 
only.  The  method  presented  here  does  not  address  an 
optimum  method  for  defuzzification,  i.e.,  how  to  make 
discrete  decisions  based  on  the  fuzzy  membership,  and 
does  not  address  how  to  make  discrete  decisions  when 
terrain  class  has  partial  membership  in  the  “unclassified” 
set.  The  research  presented  here  does  not  address  how 
to  combine  results  obtained  by  analysis  at  different 
levels  of  resolution  and/or  scale.  Further  research  in 
these  topics  is  needed,  in  the  context  of  specific 
applications. 


The  online  algorithm  results  compared  favorably  to  the 
results  obtained  from  offline  fuzzy  c-means  clustering. 
Although  the  latter  performed  measurably  better,  it  had 
the  advantage  of  seeing  all  the  data  at  once,  in  contrast 
to  the  online  algorithm,  which  is  also  image  order 
dependent,  and  is  therefore  suboptimal.  However,  the 
online  algorithm  does  use  information  about  neighboring 
image  chips  in  making  decisions.  This  information  could 
also  potentially  be  used  in  the  clustering  algorithms.  In 
addition,  we  have  proposed  a  method  for  determining  an 
optimal  number  of  clusters  when  using  exemplar-based 
clustering. 

Additional  future  work  involves  training  and  testing  on  a 
larger  set  of  images,  as  well  as  applying  the  algorithm  to 
video  streams  and  implementing  on  a  mobile  robot. 
Since  terrain  appearance  varies  as  a  function  of 
distance,  fusing  range  data  from  a  stereo  camera 
system  with  the  color  and  texture  information,  currently 
being  used,  is  anticipated  to  provide  enhanced 
performance.  Range  information  should  also  allow  us  to 
disregard  portions  of  the  image  that  do  not  pertain  to  the 
terrain  and  that  can  generate  a  large  number  of 
exemplars  due  to  their  wide  variety  of  appearance,  such 
as  objects  in  the  sky.  We  will  also  continue  to  explore 
alternative  information  such  as  multi-resolution 
processing  and  structure  filtering. 
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